Preparing Bengali-English Code-Mixed Corpus for Sentiment Analysis of Indian Languages
نویسندگان
چکیده
Analysis of informative contents and sentiments of social users has been attempted quite intensively in the recent past. Most of the systems are usable only for monolingual data and fails or gives poor results when used on data with code-mixing property. To gather attention and encourage researchers to work on this crisis, we prepared gold standard Bengali-English code-mixed data with language and polarity tag for sentiment analysis purposes. In this paper, we discuss the systems we prepared to collect and filter raw Twitter data. In order to reduce manual work while annotation, hybrid systems combining rule based and supervised models were developed for both language and sentiment tagging. The final corpus was annotated by a group of annotators following a few guidelines. The gold standard corpus thus obtained has impressive inter-annotator agreement obtained in terms of Kappa values. Various metrics like Code-Mixed Index (CMI), Code-Mixed Factor (CF) along with various aspects (language and emotion) also qualitatively polled the code-mixed and sentiment properties of the corpus.
منابع مشابه
Sentiment Analysis of Code-Mixed Indian Languages: An Overview of SAIL_Code-Mixed Shared Task @ICON-2017
Sentiment analysis is essential in many real-world applications such as stance detection, review analysis, recommendation system, and so on. Sentiment analysis becomes more difficult when the data is noisy and collected from social media. India is a multilingual country; people use more than one languages to communicate within themselves. The switching in between the languages is called code-sw...
متن کاملJU_KS@SAIL_CodeMixed-2017: Sentiment Analysis for Indian Code Mixed Social Media Texts
This paper reports about our work in the NLP Tool Contest @ICON-2017, shared task on Sentiment Analysis for Indian Languages (SAIL) (code mixed). To implement our system, we have used a machine learning algorithm called Multinomial Naïve Bayes trained using n-gram and SentiWordnet features. We have also used a small SentiWordnet for English and a small SentiWordnet for Bengali. But we have not ...
متن کاملRevisiting Automatic Transliteration Problem for Code-Mixed Romanized Indian Social Media Text
Although automatic Transliteration for Indian languages is a well studied paradigm, but availab le t ransliteration techniques fail in the Indian social media context due to phenomena such as wordplay, creative spelling, codemixing, and phonetic romanized typing; all implying that transliteration for Indian social media text has to be revisited. The paper reports an init ial study on automatic ...
متن کاملAnalyzing Roles of Classifiers and Code-Mixed factors for Sentiment Identification
Multilingual speakers often switch between languages to express themselves on social communication platforms. Sometimes, the original script of the language is preserved, while using a common script for all the languages is quite popular as well due to convenience. On such occasions, multiple languages are being mixed with different rules of grammar, using the same script which makes it a chall...
متن کاملIIT-TUDA: System for Sentiment Analysis in Indian Languages Using Lexical Acquisition
Social networking platforms such as Facebook and Twitter have become a very popular communication tools among online users to share and express opinions and sentiment about the surrounding world. The availability of such opinionated text content has drawn much attention in the field of Natural Language Processing. Compared to other languages, such as English, little work has been done for India...
متن کامل